About the project

Our era of data - larger than ever and complex like chaos - requires several skills from statisticians and other data scientists. In this course I am about to learn some new skills to be able to work with data

In this course I will learn

  • use open data science tools
    • R
    • R-markdown
    • Github

https://github.com/lottaimmeli/IODS-project


Chapter 2. Regression and model validation

This week I am doing regression analysis and model validation. But first I had to wrangle with the original data. That was pretty difficult for me, but it was good practice and I just need to practice a lot more

Description of the dataset

# After I had been wrangling with the data and managed to write it out in a folder. I could use my own wrangled data for the further analysis... 

students2014 <- read.csv(file="~/Documents/GitHub/IODS-project/Data/learning2014.csv", header = TRUE, sep=",")

#Looking the data structure

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
head(students2014)
##   gender Age Attitude Points     deep  stra     surf
## 1      F  53       37     25 3.583333 3.375 2.583333
## 2      M  55       31     12 2.916667 2.750 3.166667
## 3      F  49       25     24 3.500000 3.625 2.250000
## 4      M  53       35     10 3.500000 3.125 2.250000
## 5      M  49       37     22 3.666667 3.625 2.833333
## 6      F  38       38     21 4.750000 3.625 2.416667

This is a data of 166 observations (students). Of them we have 7 variables: the information about their gender, age, global attitude toward statistics (Attitude), exam points (Points) and their points related to different aspects of learning (Deep, strategic and surface learning).

A graphical overview of the data and show summaries of the variables in the data.

pairs(students2014[-1], col=students2014$gender)

summary(students2014)
##  gender       Age           Attitude         Points           deep      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   1st Qu.:3.333  
##          Median :22.00   Median :32.00   Median :23.00   Median :3.667  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72   Mean   :3.680  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75   3rd Qu.:4.083  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00   Max.   :4.917  
##       stra            surf      
##  Min.   :1.250   Min.   :1.583  
##  1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.188   Median :2.833  
##  Mean   :3.121   Mean   :2.787  
##  3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :5.000   Max.   :4.333

I think this method is not very good. It??s very difficult to interpret it like this. I need a better graph..

#use the packages GGally and ggplot2 and get some help with the graphical overview

library(GGally)
library(ggplot2)

p <- ggpairs(students2014, mapping = aes(alpha=0.5), lower=list(combo =wrap("facethist", bins=20)))

p

This is better view of the data. All the variables (expect for the age, where skewness >0) are pretty nicely normally distributed. This also tells me the correlation between the different variables. There doesn??t seem to be very strong correlation between the different variables since the correlation coefficients are between -0.3 - 0.4

Making a regression model

#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

model <- lm(Points ~ gender + Attitude + stra, data=students2014)
summary(model)
## 
## Call:
## lm(formula = Points ~ gender + Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97982    2.40303   3.737 0.000258 ***
## genderM     -0.22362    0.92477  -0.242 0.809231    
## Attitude     0.35100    0.05956   5.893 2.13e-08 ***
## stra         0.89107    0.54409   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

I chose to study the association of exam points (target value) with gender, attitude and strategic learning (explanatory variables). Here I can see that the Attitude is the only one of these three explanatory values to be statistifically siginificant (p-value is < 0.05.). Also the t value 5.893 tells about the significance

Next I make a regression model with only the one significant explanatory variable “Attitude”.

model2 <- lm(Points ~ Attitude, data=students2014)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Here I get the answer that the “Attitude” estimate is 0.35 and the p-value stays low (below 0.05 meaning it is significant). In other words this means that when Attitude increases by 1 unit the exam points increases by 0.35.

Next I need to explain and interpret the multiple R squared of the model. R-squared is a statistical measure of how close the data are to the fitted regression line. Meaning how well the model fits my data.

The definition of R-squared (selitysaste) is the percentage of the response variable variation that is explained by a linear model.

R-squared = Explained variation / Total variation R-squared is always between 0 and 100%:

Here in the summary I can see the Multiple R-squared is 0.1906 -> 19% so this means that my model would explain only one fifth of the exam points around their mean.

“In general, the higher the R-squared, the better the model fits my data. R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.”

“R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!”

Diagnostic plots to explain the assumptions of the model

There are several assumptions of linear regression model. With the help of the following plots I can analyze the residuals of my model and see how well my linear regression model is working here or is there some serious problems with it.

plot(model2, which= c(2,1,5))

Residuals vs fitted plot (homoscedasticity)

This plot shows that the errors/residuals have constant variance. I can find a a equally spread residuals around a horizontal line without distinct patterns. This is a good indication!

Q-Q plot (normality)

With the Q-Q plot I can explore that the residuals are normally distributed. Here I can see that point are very close to the line expect of upper and lower tails where there is some deviation. However I think this still is reasonable and I could interpret that the errors are normally distributed.

Residuals vs Leverage

This plot helps me to clear if I have outliers in my data that are influencial in my linear regression model. In my analysis I have no such cases that would be influencial in my model. All my cases are inside the Cook??s distance lines (I can not even see them)


Chapter 3. Logistic regression.

Reading the data

This week I??m learning about logistic regression.

Before this analysis part I have been doing some data wrangling and prepared the data so that I am able to analyse it.

#opening the data from the file and checking how it looks.. 

alc <- read.csv("~lottaimmeli/Documents/GitHub/IODS-project/Data/alc.csv", header = TRUE, sep = ",")

About the data

This data is about student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. There were originally two datasets provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

In my data set I have combnined the information of these two datasets. I have observations of those 382 students who were present in both of the datasets Mathematics and Portuguese. In the original dataset there were 33 variables. However I have calculated some new variables ‘alc_use’ which corresponds to the average weekday and weekend alcohol consumption anb ‘high_use’ if the ‘alc_use’ is over 2.

You can find more information about the original data set and the variables here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

dim(alc)
## [1] 382  37
colnames(alc)
##  [1] "school"      "sex"         "age"         "address"     "famsize"    
##  [6] "Pstatus"     "Medu"        "Fedu"        "Mjob"        "Fjob"       
## [11] "reason"      "nursery"     "internet"    "guardian"    "traveltime" 
## [16] "studytime"   "failures"    "schoolsup"   "famsup"      "paid"       
## [21] "activities"  "higher"      "romantic"    "famrel"      "freetime"   
## [26] "goout"       "Dalc"        "Walc"        "health"      "absences"   
## [31] "G1"          "G2"          "G3"          "alc_use"     "high_use"   
## [36] "probability" "prediction"
summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use        probability     prediction     
##  Mode :logical   Min.   :0.2883   Mode :logical  
##  FALSE:268       1st Qu.:0.3597   FALSE:272      
##  TRUE :114       Median :0.4170   TRUE :110      
##                  Mean   :0.4555                  
##                  3rd Qu.:0.5197                  
##                  Max.   :0.9398

Choosing the variables to study the relationship between high or low alcohol consumption.

The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, I choose 4 interesting variables (failure, absences, famrel and higher) and state the hypotheses.

  1. The first variable “failure” tells about the number of past class failures. I think there might be a correlation between higher alcohol consumption and more class failures.

  2. “Absences” corresponds to the number of school absences. I am instrested to find out if those who drink more have more school absences.

  3. “Famrel” corresponds to the quality of family relationships. Maybe those who have poor family relationships consume more alcohol?

  4. “Higher” clarifies if these students want to have higher education (yes or no). Maybe those who want to educate them selves in the future will study more and drink less..

Numerically and graphically exploring the distributions of my chosen variables and their relationships with alcohol consumption.

  1. First, is there some relationship between alcohol comsumptions and number of past class failures.
#Numerically exploring the relationship between high_use and failure. table and geom_count plot

mytable <- table(high_use = alc$high_use, number_of_past_class_failures = alc$failures)

addmargins(mytable)
##         number_of_past_class_failures
## high_use   0   1   2   3 Sum
##    FALSE 244  12  10   2 268
##    TRUE   90  12   9   3 114
##    Sum   334  24  19   5 382
library(ggplot2)

failure1 <- ggplot(alc, aes(x=high_use, y = failures))
failure1 + geom_count()

In this table and plot I can see that the ones who have low use of alcohol might proportionally have less class failures compared to the students with high alcohol. As seen here the ones who consume less alcohol more often have no past class failures.

  1. Is there some relationship between alcohol consumption and school absences?
#making a boxplot
g <-ggplot(alc, aes(x=high_use, y = absences))

g2 <- g + geom_boxplot() + ggtitle("Absences versus high alcohol use")
g2 

It looks like there might be a correlation between high use of alcohol and more school absences.

  1. How about the relationship between alcohol consumption and family relationships?
#making a boxplot

g_fam <-ggplot(alc, aes(x=high_use, y = famrel,))

g_fam2 <- g_fam + geom_boxplot() 
g_fam2

It seems that those who have low use of alcohol think that their family relationships are better (higher points). But which one is the cause and which one the effect. Are the students having bad relationships at home and they start to use more alcohol. Or is everything going well at home and from the point they start to use more alcohol the relationships at home gets more problematic.

  1. Relationship between alcohol consumption and future educational plans
table_plans <- table(high_use= alc$high_use, wants_high_education = alc$higher)
table_plans
##         wants_high_education
## high_use  no yes
##    FALSE   9 259
##    TRUE    9 105
round(prop.table(table_plans) * 100, 1)
##         wants_high_education
## high_use   no  yes
##    FALSE  2.4 67.8
##    TRUE   2.4 27.5

Maybe it seems that those who consume more alcohol, do not have that often plan to get high education..

Logistic regression

Now I want statistically explore the relationship between my four chosen variables and the binary high/low alcohol consumption variable as the target variable. Here is the summary of my fitted model.

#find the model with glm in the first model there is the intercept and in the m2 I took the intercept away

m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3593  -0.7890  -0.6902   1.1782   1.8993  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.12562    0.74528  -0.169 0.866143    
## failures     0.43484    0.19618   2.217 0.026653 *  
## absences     0.08341    0.02272   3.671 0.000241 ***
## famrel      -0.22708    0.12502  -1.816 0.069322 .  
## higheryes   -0.36274    0.55177  -0.657 0.510924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 436.43  on 377  degrees of freedom
## AIC: 446.43
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    failures    absences      famrel   higheryes 
## -0.12562333  0.43483926  0.08340567 -0.22707832 -0.36273561
m2 <- glm(high_use ~ failures + absences + famrel + higher - 1, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher - 
##     1, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3593  -0.7890  -0.6902   1.1782   1.8993  
## 
## Coefficients:
##           Estimate Std. Error z value Pr(>|z|)    
## failures   0.43484    0.19618   2.217 0.026653 *  
## absences   0.08341    0.02272   3.671 0.000241 ***
## famrel    -0.22708    0.12502  -1.816 0.069322 .  
## higherno  -0.12562    0.74528  -0.169 0.866143    
## higheryes -0.48836    0.51592  -0.947 0.343857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 529.56  on 382  degrees of freedom
## Residual deviance: 436.43  on 377  degrees of freedom
## AIC: 446.43
## 
## Number of Fisher Scoring iterations: 4
coef(m2)
##    failures    absences      famrel    higherno   higheryes 
##  0.43483926  0.08340567 -0.22707832 -0.12562333 -0.48835893

Logistic regression model summary interpretation: As seen above the p-value is low (<0.05) in failures, absences and famrel. The coefficient in failures and absences is positive (failures 0.45, absences 0.07), meaning that more failures and more absences predicts higher alcohol use. On the other hand familyrelationship seems to have negative effect (-0.31) on the high alcohol use. Meaning that better familyrelationship have protective effect on alcohol use.

The future education plan (plans to get a higher education) have also negative effect on the high alcohol use (-0.4), but it is not statistically significant.

Presenting and interpreting the coefficients of the model as odds ratios and provide confidence intervals for them.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.2.0 --
## <U+221A> tibble  1.3.4     <U+221A> purrr   0.2.4
## <U+221A> tidyr   0.7.2     <U+221A> dplyr   0.7.4
## <U+221A> readr   1.1.1     <U+221A> stringr 1.2.0
## <U+221A> tibble  1.3.4     <U+221A> forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)

#compute odds ratios and and confidence intervals
OR <- coef(m) %>% exp
CI <- exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.8819470 0.1995909 3.776889
## failures    1.5447147 1.0499664 2.282173
## absences    1.0869827 1.0418018 1.139025
## famrel      0.7968584 0.6230747 1.019051
## higheryes   0.6957704 0.2372414 2.121797

In this table one can see the coefficients of the model as odds ratios and their confidence intervals. The odds ratios for failures is 1.54 (CI 1.05-2.28) and for absences 1.08 (1.04-1.14) meaning that higher consumption of alcohol increases the odds for absence and failures by 1.54 and 1.08 times respectively. The better family relationship makes the odds for high alcohol consumption 0.8 times less likely.

The higher educational plan is not significant since the coefficient is 0.70 and CI is 0.2-2.1 (number 1 in included in the CI).

These findings support my earlier hypotheses.

Exploring the predictive power of my model.

I??m using the variables which had a statistical relationship with high/low alcohol consumption according to my model and exploring the predictive power.

# predict the probability of high use and then adding, the new variable in the alc dataset. Then using the probability to make a prediction of high_use. lookin the last 10 observations

probabilities <- predict(m, type= "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, famrel, high_use, probability, prediction) %>% tail(10)
##     failures absences famrel high_use probability prediction
## 373        1        0      4    FALSE   0.2765114      FALSE
## 374        1        7      5     TRUE   0.3531843      FALSE
## 375        0        1      5    FALSE   0.1764851      FALSE
## 376        0        6      4    FALSE   0.2898242      FALSE
## 377        1        2      5    FALSE   0.2646186      FALSE
## 378        0        2      4    FALSE   0.2262058      FALSE
## 379        2        2      2    FALSE   0.5234763       TRUE
## 380        0        3      1    FALSE   0.3857482      FALSE
## 381        0        4      2     TRUE   0.3523118      FALSE
## 382        0        2      4     TRUE   0.2262058      FALSE
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.25392670 0.04450262 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000
gpred <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

gpred + geom_point()

# the training error

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) >0.5
  mean(n_wrong)
}

The proportion of incorrectly classified observations (meaning the training error) is (0.30) 30%. I think the percentage is pretty high. I can’t really think my model is very good when one third of the observations are incorrectly classified.

Then I also made a 10-fold cross-validation where I got the same number (0.30) 30% as the average number of wrong predictions.

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293

Chapter 4. Clustering and classification

The theme for this week is clustering and classification. This is something totally new for me and let??s see what I will learn this week.

About the data

In this exercise we are examining the Boston dataset from R MASS package. This dataset is about housing values in suburbs of Boston. The data frame has 506 observations and 14 variables.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#calling for the Boston dataset form MASS package
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

So we have 506 subrubs of Boston and here are the explanations of the different variables:

crim : per capita crime rate by town zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox: nitrogen oxides concentration (parts per 10 million) rm: average number of rooms per dwelling age: proportion of owner-occupied units built prior to 1940 dis: weighted mean of distances to five Boston employment centres rad: index of accessibility to radial highways tax: full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat: lower status of the population (percent) medv: median value of owner-occupied homes in $1000s

#calling for the different packages I might use in this exercise
library(ggplot2)
library(GGally)
library(tidyverse)
library(corrplot)
## corrplot 0.84 loaded
#checking the summary of the Boston dataset, 
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here is the summary of the data. Chas is a binary variable. Other variables execpt for the crim and zn seems to be normally distributed (mean and median more or less close to each other).

Looking for the correlations

Let’s also make a graph..

#graphical overview
pairs(Boston)

With the pairs plot it’s not very easy the see any corralations. Let??s study the correlations between the different variables with correlation plot.

#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation

cormatrix <- cor(Boston)
cormatrix %>% round(digits=2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cormatrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

There is strong negative correlation (big red ball), between dis-nox, dis-age adn dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This makes sense.

Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is understandable.

Rad and tax have strong positive correlation, meaning when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises. Why not?

Getting ready for the analysis

Let’s move furher with the analysis..

I need to standardise the dataset to get normally distributed data. I print the summary of the scaled data set.

#Standardising the dataset
boston_scaled <- scale(Boston)

#printing out summaries of the scaled data
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#My boston_sclaed data is a matrix and I make it as a data frame for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled<- as.data.frame(boston_scaled)

Now we have scaled the data and as seen in the summary now all the means and medians are close to each other meaning that they are normally distributed and with the help of the scaling this data can be fitted in a model.

Next I will change the continuous crime rate variable in my data set to be a categorical variable. I want to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. I drop the old crim variable from the dataset and replace it with the new crime variable.

#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#and create a categorial variable crime

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label= c("low", "med_low", "med_high", "high"))


table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
library(dplyr)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
dim(boston_scaled)
## [1] 506  14

Now I need to divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

#Here I make the train and the test sets. I choose 80% of the observations to the train set and the rest to the test set
dim(boston_scaled)
## [1] 506  14
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)

dim(ind)
## NULL
#create the train set

train <- boston_scaled[ind,]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 0.585 -0.487 -0.487 -0.487 ...
##  $ indus  : num  1.567 -0.915 -0.18 -0.164 -0.548 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  0.5981 -1.1106 -0.0923 -0.0664 -0.5324 ...
##  $ rm     : num  0.0589 0.6296 -0.825 -0.5873 0.2012 ...
##  $ age    : num  1.035 -1.246 0.324 0.161 -0.578 ...
##  $ dis    : num  -0.7238 0.7625 0.0712 -0.6257 0.354 ...
##  $ rad    : num  -0.637 -0.637 -0.637 -0.408 -0.522 ...
##  $ tax    : num  0.171 -0.755 -0.618 0.141 -0.719 ...
##  $ ptratio: num  1.2677 0.2515 -0.0257 -0.3028 0.5286 ...
##  $ black  : num  0.441 0.441 0.435 -0.198 0.441 ...
##  $ lstat  : num  -0.055 -1.031 -0.161 0.38 -0.764 ...
##  $ medv   : num  -0.319 0.594 -0.689 -0.232 0.138 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 3 1 2 2 2 2 4 4 4 4 ...
#create the test set
test <- boston_scaled[-ind,]
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  0.0487 0.0487 0.0487 -0.4872 -0.4872 ...
##  $ indus  : num  -0.476 -0.476 -0.476 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.265 -0.265 -0.265 -0.144 -0.144 ...
##  $ rm     : num  -0.399 -0.392 -0.563 -0.478 -0.268 ...
##  $ age    : num  0.615 0.509 -1.051 -0.241 0.566 ...
##  $ dis    : num  1.328 1.155 0.786 0.433 0.317 ...
##  $ rad    : num  -0.522 -0.522 -0.522 -0.637 -0.637 ...
##  $ tax    : num  -0.577 -0.577 -0.577 -0.601 -0.601 ...
##  $ ptratio: num  -1.5 -1.5 -1.5 1.18 1.18 ...
##  $ black  : num  0.329 0.441 0.371 0.441 0.256 ...
##  $ lstat  : num  0.6227 0.0864 0.4281 -0.6152 -0.3351 ...
##  $ medv   : num  -0.395 -0.395 -0.0906 -0.2319 -0.4711 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 2 2 3 3 3 3 3 3 2 ...

Linear discriminan analysis (LDA)

Now I’m making a linear discriminant analysis on the train set. I use the categorical crime rate as the target variable and all the other variables are predictor variables. I draw the LDA (bi)plot of the model.

# fit the linear discriminant analysis on the train set. crime as the target variable and all the other variables as predictor variables


lda.fit <- lda(formula= crime ~ ., data = train)


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# drawing a plot of the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Then I predict the classes with the LDA model on my test data and cross tabulate the results with the crime categories from the test set.

#saving the crime categories from the test set 
correct_classes <- test$crime

library(dplyr)


#removing the categorial crime variable from the test dataset

test <- dplyr::select(test, -crime)

#Predicting the vallses with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

#cross tablating the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       8        0    0
##   med_low    3      19        3    0
##   med_high   0      12       12    1
##   high       0       0        0   29

Here I can see how well my model is working with the predicting. My model works well predicting the high crime rates, byt it makes some errors predicting the other classes. The same phenomena was visible in the train set plot.

Distance measures and clustering

Next I move towards clustering and measure the distances. I use the Euklidean distance, which is possibly the most common one. K-means is old and much used clustering method. Kmeans counts the distance matrix automatically but I have to choose the number of clusters. I tryed to make the model wiht 4 clusters, but for me it seems that 3 clusters works better.

Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

#Reloading the Boston dataset and standardising the dataset (variables have to be normally ditributed)
dim(Boston)
## [1] 506  14
scale_Boston2 <- scale(Boston)

scale_Boston2 <- as.data.frame(scale_Boston2)


#Calculating the distances. Euklidean distance

dist_eu <- dist(scale_Boston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(scale_Boston2, centers = 3)


# ploting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:6], col = km$cluster)

pairs(scale_Boston2[7:13], col = km$cluster)

Next I investigate what is the optimal number of clusters. There are many ways to find the optimal number of clusters but here I will be using the Total of within cluster sum of squares (WCSS) and visualising the result with a plot.

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal number of clusters is when the total WCSS drops radically and above I can see that it happens around x= 2. So the optimal number of clusters would be 2. Next I run the algorithm again with two clusters.

Clustering and Interpretation

The bigger pairs plot makes a bit difficult to interpret the results so I made also two more precise plots to be able to study some effects more precise. In the first plot (where all the variables are included) I can see that the variables chas doesn’t follow any pattern with any of the variables. There are many pairs that doesn??t follow any nice pattern. However I think I might find negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.

# k-means clustering
km <-kmeans(scale_Boston2, centers = 2)

# plot the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:8], col = km$cluster)

pairs(scale_Boston2[6:13], col = km$cluster)

.


Chapter 5. Dimensionality and reduction techniques

Describing the data set

This week I’m learning about dimensionality and reduction techniques. I will be working with ??human??-data. The dataset originates from the United Nations Development Programme and it is about measuring the development of a country with Human Development Index (HDI). More information about the HDI can be found here http://hdr.undp.org/en/content/human-development-index-hdi And more about the calculation of the HDI found here http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf

I’ve been doing some data wrnagling before the analysis and here is briefly summary about my data.

human <- read.table(file="~lottaimmeli/Documents/GitHub/IODS-project/Data/human.csv", header = TRUE, row.names=1, sep = ",")

dim(human)
## [1] 155   8
colnames(human)
## [1] "Edu.Ratio"  "Labo.Ratio" "Life.Exp"   "Edu.Exp"    "GNI"       
## [6] "Mat.Mor"    "Ado.Birth"  "Rep.Per"
summary(human)
##    Edu.Ratio        Labo.Ratio        Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Rep.Per     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
head(human)
##             Edu.Ratio Labo.Ratio Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389  0.8908297     81.6    17.5 64992       4       7.8
## Australia   0.9968288  0.8189415     82.4    20.2 42261       6      12.1
## Switzerland 0.9834369  0.8251001     83.0    15.8 56431       6       1.9
## Denmark     0.9886128  0.8840361     80.2    18.7 44025       5       5.1
## Netherlands 0.9690608  0.8286119     81.6    17.9 45435       6       6.2
## Germany     0.9927835  0.8072289     80.9    16.5 43919       7       3.8
##             Rep.Per
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9

I have a data of 155 countries and 8 variables of each country. First one can see the name of the country and the there are some indicators about the country’s development.

Empowerment: Edu.Ratio = Proportion of females with at least secondary education / Proportion of males with at least secondary education Labo.Ratio = Proportion of females in the labour force / Proportion of males in the labour force Rep.Per = Percetange of female representatives in parliament

Health and knowledge: Life.Exp = Life expectancy at birth Edu.Exp = Expected years of schooling GNI = Gross National Income per capita Mat.Mor = Maternal mortality ratio Ado.Birth = Adolescent birth rate

#looking the structude of the data

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu.Ratio : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.Ratio: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI       : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor   : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Rep.Per   : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(tidyverse)
library(dplyr)
library(ggplot2)
library(GGally)

#Also making a graphical overview of the data
gather(human) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

h <- ggpairs(human, mapping = aes(alpha=0.5), lower=list(combo =wrap("facethist", bins=20)))
h

Here can be seen a graphical overview and a summary of the data. All the variables are numeric, one as a interval and other continous numeric variables. Only the Edu.Exp variable is normally distributed.

Let’s study the relationship between the variables with correlation matrix.

#making a correlation matrix and drawing a correlation plot to be able to visualie it better and for easier interpretation

library(corrplot)
library(tidyverse)

correlation <- cor(human)
correlation %>% round(digits=2)
##            Edu.Ratio Labo.Ratio Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth
## Edu.Ratio       1.00       0.01     0.58    0.59  0.43   -0.66     -0.53
## Labo.Ratio      0.01       1.00    -0.14    0.05 -0.02    0.24      0.12
## Life.Exp        0.58      -0.14     1.00    0.79  0.63   -0.86     -0.73
## Edu.Exp         0.59       0.05     0.79    1.00  0.62   -0.74     -0.70
## GNI             0.43      -0.02     0.63    0.62  1.00   -0.50     -0.56
## Mat.Mor        -0.66       0.24    -0.86   -0.74 -0.50    1.00      0.76
## Ado.Birth      -0.53       0.12    -0.73   -0.70 -0.56    0.76      1.00
## Rep.Per         0.08       0.25     0.17    0.21  0.09   -0.09     -0.07
##            Rep.Per
## Edu.Ratio     0.08
## Labo.Ratio    0.25
## Life.Exp      0.17
## Edu.Exp       0.21
## GNI           0.09
## Mat.Mor      -0.09
## Ado.Birth    -0.07
## Rep.Per       1.00
corrplot.mixed(correlation, lower.col = "black", number.cex = .6)

Here can be seen that percetange of female representatives in parliament (Rep.Per) or Proportion of females in the labour force / Proportion of males in the labour force (Labo.Ratio) don’t seem to have strong correlations with any of the other variables.

The maternal mortality ratio (Mat.Mor) and life expectancy have strong negative correlation. Meaning that when maternal mortality ratio gets higher life expactancy gets lower, which makes sense. Also adolescence birth ratio (Ado.Birth) has strong negative correlation with life expectancy. Higher education datio and GNI seems to affect positively to life expactancy.

Principal component analysis (PCA)

First I make the PCA to the non-standardised data.

Principal Component Analysis (PCA) can be performed by two slightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD). The function prcomp() function uses the SVD and is the preferred, more numerically accurate method.

#Making PCA with SVD method
pca_human <- prcomp(human)
sum_pca_human<-summary(pca_human)
sum_pca_human
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
pca_pr <- round(100*sum_pca_human$importance[2, ], digits = 1)
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")


biplot(pca_human, choices = 1:2, cex= c(0.8,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of non-scaled human data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Now almost all the variables are gathered in a one courner and I get only one arrow. The other arrows are of zero length and indeterminate angle so the are skipped. Because the PCA is sensitive to the relative scaling of the original features and it assumes that features with larger variance are more important than features with smaller variance without the scaling my biplot looks like this. The GNI has the largest variance it becomes dominant here.

Standardisation of the data might be a good idea. So next I’m going to standardise the data and do the PCA again.

#Standardise the data and make the data matrix as data frame for the further analysis.

human_scaled <- scale(human)
str(human_scaled)
##  num [1:155, 1:8] 0.639 0.596 0.54 0.562 0.481 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
##   ..$ : chr [1:8] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" ...
##  - attr(*, "scaled:center")= Named num [1:8] 8.53e-01 7.07e-01 7.17e+01 1.32e+01 1.76e+04 ...
##   ..- attr(*, "names")= chr [1:8] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" ...
##  - attr(*, "scaled:scale")= Named num [1:8] 2.42e-01 1.99e-01 8.33 2.84 1.85e+04 ...
##   ..- attr(*, "names")= chr [1:8] "Edu.Ratio" "Labo.Ratio" "Life.Exp" "Edu.Exp" ...
summary(human_scaled)
##    Edu.Ratio         Labo.Ratio         Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Rep.Per       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
class(human_scaled)
## [1] "matrix"
human_scaled<- as.data.frame(human_scaled)

#Make the pca again with SVD method


pca_human_s <- prcomp(human_scaled)

sum_pca_human_s<-summary(pca_human_s)
pca_pr_s <- round(100*sum_pca_human_s$importance[2, ], digits = 1)
pc_lab<-paste0(names(pca_pr_s), " (", pca_pr_s, "%)")

sum_pca_human_var_s<-sum_pca_human_s$sdev^2
sum_pca_human_var_s
## [1] 4.2883701 1.2989625 0.7657100 0.6066276 0.4381862 0.2876242 0.2106805
## [8] 0.1038390
biplot(pca_human_s, choices = 1:2, cex= c(0.5,1.0), col=c("coral", "black"), xlab = pc_lab[1], ylab = pc_lab[2], main = "PCA plot of scaled human data")

The standarisation helps a lot. Now the relative scaling between the variables is more similar and the GNI (with largest variance) doesn’t run over the other variables.

Edu.Exp, GNI, Edu.Ratio and Life.Exp are situated together and the arrows share a small angle meaning that these variables have high positive correlation. The arrows of Mat.Mor and Ado.Birth are directed to the opposite direction meaning that they have high negative correlation with the earlier mentioned features. All these factors have high angle with Labo.Ratio and Rep. Per meaning that there is not high correlation.

The angle between a feature and a PC axis can be interpret as the correlation between the two. Small angle = high positive correlation.

The length of the arrows are proportional to the standard deviations of the features and they seem to be pretty similar with the different variables.

Multiple Correspondence Analysis (MCA)

Next I will be doing some MCA. For this I need the FactoMineR package.

library(FactoMineR)

From this packages I will be using the tea data. Now let’s see how the data looks like.

data(tea)
colnames(tea)
##  [1] "breakfast"        "tea.time"         "evening"         
##  [4] "lunch"            "dinner"           "always"          
##  [7] "home"             "work"             "tearoom"         
## [10] "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"           
## [16] "how"              "where"            "price"           
## [19] "age"              "sex"              "SPC"             
## [22] "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"         
## [28] "diuretic"         "friendliness"     "iron.absorption" 
## [31] "feminine"         "sophisticated"    "slimming"        
## [34] "exciting"         "relaxing"         "effect.on.health"
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

I can see that I have data of 300 observations and 36 variables. Except of the variable “age” all the other variables are categorical with 2 to 7 categories.

I also make a graphical overview of the data.

With MCA I can analyse the pattern of relationships of several categorical variables. MCA deals with categorical variables, but continuous ones can be used as background (supplementary) variables

For the categorical variables, I can either use the indicator matrix or the Burt matrix in the analysis The Indicator matrix contains all the levels of categorical variables as a binary variables (1 = belongs to category, 0 = if doesn’t) Burt matrix is a matrix of two-way cross-tabulations between all the variables in the dataset

For the further analysis I will not be using all the 36 variables. Let’s choose some of them. And make a summary and graphical overview.

#Choose the column to keep
library(dplyr)
library(tidyverse)
library(FactoMineR)

keep <- c("Tea", "How", "sugar", "sex", "age_Q", "breakfast", "work", "price")

#make a new data set with the selected columns
tea1 <- tea[, keep]
library(GGally)
library(ggplot2)

#summary and visualisation of the tea set with my chosen variables
summary(tea1)
##         Tea         How           sugar     sex       age_Q   
##  black    : 74   alone:195   No.sugar:155   F:178   15-24:92  
##  Earl Grey:193   lemon: 33   sugar   :145   M:122   25-34:69  
##  green    : 33   milk : 63                          35-44:40  
##                  other:  9                          45-59:61  
##                                                     +60  :38  
##                                                               
##          breakfast         work                 price    
##  breakfast    :144   Not.work:213   p_branded      : 95  
##  Not.breakfast:156   work    : 87   p_cheap        :  7  
##                                     p_private label: 21  
##                                     p_unknown      : 12  
##                                     p_upscale      : 53  
##                                     p_variable     :112
gather(tea1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust =1, size =8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# multiple correspondence analysis
mca <- MCA(tea1, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea1, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.219   0.185   0.182   0.162   0.154   0.138
## % of var.              9.737   8.231   8.068   7.191   6.831   6.137
## Cumulative % of var.   9.737  17.968  26.036  33.228  40.058  46.195
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.135   0.129   0.127   0.118   0.109   0.109
## % of var.              6.009   5.712   5.653   5.231   4.866   4.835
## Cumulative % of var.  52.203  57.916  63.568  68.799  73.666  78.501
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.098   0.092   0.089   0.071   0.068   0.065
## % of var.              4.366   4.102   3.947   3.173   3.035   2.876
## Cumulative % of var.  82.867  86.969  90.916  94.089  97.124 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1         |  0.135  0.028  0.004 |  0.235  0.099  0.012 |  0.882  1.428
## 2         |  0.561  0.479  0.162 |  0.751  1.014  0.290 |  0.022  0.001
## 3         | -0.086  0.011  0.005 |  0.591  0.629  0.239 | -0.418  0.321
## 4         | -0.462  0.325  0.192 | -0.469  0.395  0.198 | -0.087  0.014
## 5         |  0.122  0.023  0.011 |  0.262  0.124  0.052 | -0.029  0.002
## 6         | -0.291  0.129  0.033 | -0.279  0.140  0.031 | -0.188  0.065
## 7         |  0.061  0.006  0.002 |  0.309  0.172  0.058 |  0.143  0.038
## 8         |  0.507  0.392  0.115 |  0.556  0.556  0.138 |  0.057  0.006
## 9         |  0.408  0.254  0.069 |  0.340  0.208  0.048 |  0.615  0.695
## 10        |  0.808  0.993  0.280 |  0.220  0.087  0.021 |  0.407  0.304
##             cos2  
## 1          0.163 |
## 2          0.000 |
## 3          0.120 |
## 4          0.007 |
## 5          0.001 |
## 6          0.014 |
## 7          0.012 |
## 8          0.001 |
## 9          0.156 |
## 10         0.071 |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black     |   1.097  16.950   0.394  10.859 |   0.336   1.881   0.037
## Earl Grey |  -0.532  10.397   0.511 -12.360 |   0.056   0.135   0.006
## green     |   0.652   2.666   0.052   3.962 |  -1.080   8.663   0.144
## alone     |  -0.081   0.243   0.012  -1.908 |  -0.127   0.712   0.030
## lemon     |  -0.121   0.092   0.002  -0.736 |  -0.564   2.366   0.039
## milk      |   0.053   0.034   0.001   0.471 |   0.566   4.545   0.085
## other     |   1.829   5.723   0.103   5.561 |   0.866   1.519   0.023
## No.sugar  |   0.462   6.300   0.228   8.265 |   0.403   5.664   0.174
## sugar     |  -0.494   6.734   0.228  -8.265 |  -0.431   6.055   0.174
## F         |  -0.049   0.080   0.003  -1.015 |   0.288   3.323   0.121
##            v.test     Dim.3     ctr    cos2  v.test  
## black       3.326 |   0.337   1.930   0.037   3.335 |
## Earl Grey   1.296 |  -0.085   0.317   0.013  -1.965 |
## green      -6.566 |  -0.261   0.516   0.008  -1.587 |
## alone      -3.002 |  -0.349   5.462   0.227  -8.232 |
## lemon      -3.431 |   0.555   2.333   0.038   3.374 |
## milk        5.048 |   0.781   8.825   0.162   6.965 |
## other       2.634 |   0.066   0.009   0.000   0.200 |
## No.sugar    7.205 |  -0.319   3.620   0.109  -5.703 |
## sugar      -7.205 |   0.341   3.870   0.109   5.703 |
## F           6.016 |  -0.559  12.787   0.457 -11.685 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## Tea       | 0.526 0.158 0.040 |
## How       | 0.107 0.135 0.241 |
## sugar     | 0.228 0.174 0.109 |
## sex       | 0.003 0.121 0.457 |
## age_Q     | 0.550 0.332 0.313 |
## breakfast | 0.000 0.173 0.055 |
## work      | 0.097 0.325 0.056 |
## price     | 0.241 0.063 0.181 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

Interpretation of the MCA The distance between the different points gives a measure of their similarity (or dissimilarity). Younger peole (age cat 15-24 and 25-34) likes to have tea with sugar and at other time than breakfast. And people age 35-44 and 45-59 prefer tea without sugar.